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Abstract: We investigate Monte Carlo updating algorithms for simulating SU{N) Yang- 
Mills fields on a single-site lattice, such as for the Twisted Eguchi-Kawai model (TEK). 
We show that performing only over-relaxation (OR) updates of the gauge links is a valid 
simulation algorithm for the Fabricius and Haan formulation of this model, and that this 
decorrelates observables faster than using heat-bath updates. We consider two different 
methods of implementing the OR update: either updating the whole SU(N) matrix at 
once, or iterating through SU(2) subgroups of the SU(N) matrix, we find the same critical 
exponent in both cases, and only a slight difference between the two. 
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1 Introduction 

As originally proposed by t’Hooft [1], the large N limit of SU(N) Yang-Mills (YM) theories 
at fixed t’Hooft coupling is an approximation to the model of strong interactions. Being 
simpler from an analytic point of view, it was hoped that it could lead to an understanding 
of the properties of hadrons at low energies. Despite the fact that only a subset of Feynman 
diagrams contribute in this limit, these initial hopes turned out to be too optimistic. Large 
N YM theories are full of rich phenomena but complicated enough to resist a complete 
understanding. In parallel, its interest has grown and extends much beyond its role as an 
approximation to strong interactions. The t’Hooft limit appears as an important ingredient 
in many recent developments, such as the connections of field theory with string theory 
and/or gravitational interactions as in the AdS/CFT construction. 

Given its interest, several authors have used lattice gauge theory techniques to study 
the non-perturbative behaviour of YM theories in the large N limit (for a recent review 
see [2]). This program is challenging, since the number of degrees of freedom that have to 
be simulated in a computer increases with the rank of the group, making the simulations 
more difficult with growing N. Alternatively, one can exploit the idea of volume reduction. 
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The study of the Schwinger-Dyson equations of U(N ) YM theories led to the conjecture 
that gauge theories become volume independent in the large N limit [3]. Taking the idea 
of volume reduction to an extreme, one can simulate a lattice with one single point. This 
allows one to reach much larger values of N and can provide a more precise method of 
determining the large N observables. 

Attempts to produce a successful implementation of the volume independence idea 
have lead to several proposals [4-12] after the original one was shown not to work [4] (see 
also Ref. [2] for a recent account). Our study here will focus on the twisted reduction 
idea originally proposed by some of the present authors [5], but its scope applies to other 
reduced models as well. In particular, the one-site twisted Eguchi-Kawai model( TEK) [6] 
with fluxes chosen in a suitable range [7] has been tested recently in several works. There is 
now strong numerical evidence that its results for various lattice observables coincide with 
those of SU(N ) gauge theories extrapolated to N —> oo [13]. Furthermore, the tests have 
also extended to physical quantities in the continuum limit such as the string tension [14] 
or the renormalized running coupling [15]. 

In all the previous works a connection was established between the finite N corrections 
of the twisted model and finite volume effects of the ordinary gauge theory. Typically N 2 
plays the role of the physical volume. Hence, in order to extract physical quantities with 
small systematic errors, one should work at rather large values of N (0(1000)). These 
types of simulations present their own challenges, which are sometimes very different to 
the ones we are used to face in usual lattice simulations. The literature is somewhat scarce 
and old (this will be reviewed in the next section). This serves as a motivation for the 
present work, devoted to the analysis of the computational aspects of this one-site model. 
We will present the different algorithms that can be used in the simulations, and numerical 
tests of the correctness and efficiency of these algorithms. 

We must conclude by mentioning that the interest of the reduced matrix models ex¬ 
tends beyond their purely computational one. First of all, the methodology developed 
here is useful for extensions of pure Yang-Mills theories such as the model with fermionic 
fields in the adjoint representation with various types of boundary conditions [9, 16]. This 
allows the study of a possible candidate for walking technicolor and the determination of 
its anomalous dimensions [17, 18]. There are also other intriguing phenomena, such as 
a stronger form of N-L scaling [19], [20], the emergence of new symmetries [10] or the 
connections with non-commutative space-time [21-24], In summary, we are quite confident 
that the methodology studied here will be useful for future researchers in the field. 

2 Update algorithms for the one-site model 

The action of the TEK model in d space-time dimensions is defined as a function of the 
variables (fi = 0,..., d — 1) that are elements of the group SU(N) [6] 

Stek[U] = bN J2 Tr (u - . (2.1) 
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where b can be interpreted as the inverse of the ’t Hooft coupling A = g 2 N, and z lw is 
given by 


= exp I - 


( 2-k in 




N 


( 2 . 2 ) 


where n „„ (called the twist tensor) are integers defined modulo N. The action Eq. (2.1) is 
just the Wilson action of an SU(N) lattice gauge theory defined on a one-site lattice with 
twisted boundary conditions. 

The partition function of this model is given by 


z= / V[U] exp{-5 TE K[C/]} , 


(2.3) 


and in principle can be simulated with the usual Monte Carlo techniques like, for example, 
the metropolis algorithm [25]. Nevertheless the most effective algorithms used to simulate 
pure gauge theories [26-31], a combination of heatbath (HB) and overrelaxation (OR) 
sweeps, cannot be applied directly to this model. The reason is that the action Eq. (2.1) 
is not a linear function of the links U^. The solution to this problem was discovered long 
ago by Fabricius and Haan [32] and consists in introducing auxiliary normally distributed 
complex N x N matrices Q^ with ji > v. The original partition function Eq. (2.3) can be 
written as 

Z = Af j V[U]V[Q ] exp{-5 TE K[C/]} exp j~ J^Tr (q^Q^) j , (2.4) 

where Af is a constant. If we perform the change of variables 

Qnu = Qu,v - t^U^Uv - UnUvUn , V = e nin ^ IN V2Nb , (2.5) 


the partition function Eq. (2.3) can be written as 


Z=M 


V[U]V[Q] exp | -5 TE kq[C/, Q] ~ \ £ Tr (q\ v Q^) 


fj.>v 


where 1 the modified action S'tekq given by 


•S’tekqI^, Q] 


ReTr Q\ w + t UfJ U u U^ 


)1>V 


( 2 . 6 ) 


(2.7) 


is linear in the link variables U, t . 

At this stage standard algorithms like heatbath and overrelaxation can be applied 
to update the links. The Fabricius-Haan algorithm thus involves the updating of the 
auxiliary variables followed by the link updates. We will describe below several updating 
possibilities and show that the optimal algorithm only requires over-relaxation updates of 
the links combined with the updating of the auxiliary variables. 

1 Note that the Jacobian of the change of variables Eq. (2.5) is just one. 
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2.1 Update of the link variables 

The terms of the action that involve some link U a are 

S a [U a ] = -ReTv (U a H a ) , (2.8) 

where H a is the sum of staples given by 

Hot = tavUvQav + tvaQa V Uv , (2-9) 

V^OL 

and we have defined Q au = Q ua for a <v. 

Each link U a can be updated using a combination of heatbath and overrelaxation 
updates. The heatbath updates are performed by projecting into 577(2) subgroups as 
described in [26, 27, 32]. The overrelaxion updates can be implemented through sequential 
SU(2) updates [28-31] or by performing an overrelaxation update of the whole SU(N) 
matrix [8, 33]. 


2.1.1 577 ( 2) projections 

We introduce SU(2) subgroups 'H(ij) C SU(N) labelled by two integers (i, j) such that 
i 7 ^ j. An element A of the subgroup can be written as 


A e //, 


(.id) 


Aki = 


&ki k11 7^ it j 
aki otherwise 


( 2 . 10 ) 


with the 2x2 submatrix aki being an SU( 2) element. 

The update of an SU (. N ) matrix proceeds by choosing a set S of these SU (2) subgroups 
of SU(N) [25, 27] 

S = (2.11) 

such that no subset of SU(N) remains invariant under left multiplication by S. This 
requirement can be satisfied with k = N — 1 as follows 


^minimal — { 77 ( z _ «)}«.»-i - < 2 - 12 ) 

However covering the full group requires order N steps of this kind. Hence, as experience 
shows, it is more effective to choose S to be composed of all the N(N — l)/2 subgroups of 
this type 

5 = (»(«)),<, • ( 2 - 13 ) 

If A G 'Hu j) is a matrix of the type Eq. (2.10), it is easy to evaluate the one-link action 
Eq. (2.8) 

S a [AU a ] = —TV (aw) + terms independent of a . (2-14) 

Here a and w/\w\ are 577(2) matrices with w = Re(wo) +iRe(wi)Ti obtained from the 2x2 
complex matrix given by 

w = w 0 + iwiTi = ( ) , W = U a H a . (2.15) 

\ W ji W 33 J 
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A heatbath update [32] generates the matrix a with probability e Sa . An overrelax¬ 
ation update generates a by reflecting around the matrix w 

a = (w^) 2 /\w\ 2 . (2.16) 

In both cases the matrix A is then used to update the link U a 

U a -> AU a . (2.17) 

We will call a sweep an update for each of the links by all of the 51/(2) subgroups in 
S (Eq. (2.13)). A sweep therefore consists of the left multiplication of the link U a by 
k = N(N — l)/2 matrices of the form Eq. (2.10) for each a = 0,..., d — 1. 

It is important to note that an overrelaxation update does not change the value of 
Eq. (2.14), and hence the value of the action Stekq[U, Q] remains invariant. 

2.1.2 U(N) overrelaxation 

The authors of [8] proposed to perform an overrelaxation update of the whole SU(N) 
matrix at once. The starting point is again the link U a which we want to update according 
to the probability density 


P(U a ) oc exp (—ReTr [U a H a ]) ■ (2-18) 

where H a is the sum of staples defined in Eq. (2.9). The SU(N) overrelaxation step consists 
of a proposed update of the form 

U™ W = WaU^Wa ■ (2.19) 

where W a is an SU(N) matrix that does not depend on U a or t/j) ew . The change is 
accepted or rejected with probability P(U^ ew )/P(U a ). This is a valid update algorithm, 
but its efficiency depends on the acceptance rate. To maximize it one should minimize the 
change in the action caused by the proposed update. In fact, as we will see, if the link 
variables were elements of the U(N) group, the procedure described in [8, 33] results in an 
exact microcanonical update (i.e. the action Stekq[£4 Q] is unchanged by the update). 
The construction is as follows. Given the singular value decomposition of the staple 

H a = X Q aV^ , (2.20) 

where a is diagonal and real, and X,V are U(N ) matrices, we construct the matrix 

W a = E a At . (2.21) 

This matrix is an element of U(N ) by construction, and it is a straightforward exercise to 
check that the action <Stekq Wi Q\ unchanged by the transformation of Eq. (2.19). 

All updates referred to as ORN in the rest of this paper use this method. 

The algorithm is applicable to one-site models directly because for them there is no 
difference between the U(N) and SU(N) gauge models. This is obvious since any possible 
phase factor e ia of the links U fl cancels in the evaluation of the action Eq. (2.1). Moreover 
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Algorithm 1 Update algorithm for one site models. 

1: for a = 0, d — 1 do 

2: Generate d — 1 auxiliary variables Q a/1 according to Eq. (2.5) 

3: Compute the Staples H a according to Eq. (2.9) 

4: Update U a using HB/OR2/ORN 


if the observables of interest are center-invariant (i.e. any Wilson loop-like quantity), the 
phases also do not play any role in the evaluation of these observables. This statement can 
be made explicit in the partition function by introducing another set of real variables a^ 
uniformly distributed in (0, 27r) and writing the probability measure as 



After the change of variables the action of the SU(N) TEK model transforms 

into that of the U(N ) TEK model. We stress that this is a particularity of the one-site 
model, and in that in general SU(N) and U(N) gauge theories on the lattice differ by a 
0(1/N 2 ) contribution. 

Given that here we are concentrating on one-site lattice models we will stick to the 
previously defined microcanonical U(N) overrelaxation update for the rest of this paper. 
However, for models in which not all directions are fully reduced the equivalence of SU(N) 
and U(N) models does not apply. For the sake of the readers we will develop upon the 
SU(N) case in the appendix A. 

2.2 A new update algorithm for the one-site model 

Monte Carlo algorithms for these type of models consists of two elements: the update of the 
auxiliary matrices followed by any combination of the possible updating procedures 
described in the previous section: heat-bath (HB), SU(2) overrelaxation steps (OR2) and 
U(N) overrelaxation steps (ORN). 

The new update algorithm that we will propose and analyze numerically in this paper 
can be written in a few lines (algorithm 1 with only over-relaxation updates). It requires 
generating d(d— 1) auxiliary matrices per sweep combined with only overrelaxation updates 
for the link update. Despite the absence of any HB update this is a valid algorithm since 
it satisfies detailed balance and ergodicity. The latter is proven in the next subsection. 

Since in the one-site model the over-relaxation updates are applied to the action 
‘S'tekq [U, Q\, it is the value of this action that is unchanged by over-relaxation updates, but 
the original action of the TEK model <Stek[H] does change by the combination of intro¬ 
ducing the auxiliary variables and performing an over-relaxation sweep. The situation has 
some similarities with the HMC algorithm, where one also introduces auxiliary variables 
(the random momenta), and performs an update that leaves the Hamiltonian unchanged. 

2.3 Ergodicity of over-relaxation updates 

Here we will prove that the generation of the auxiliary variables followed by an overrelax¬ 
ation update allows to reach the full space of unitary matrices with non-zero probability. 
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The key ingredient in the proof is that the mapping from the auxiliary matrices to the 
staples H a , Eq. (2.9), is a surjective map from the space of complex N x N matrices onto 
itself, except in some exceptional situations. If we consider one particular direction a, we 
have 

H a = £ U u Ql u + t ua Ql„U v , (2.23) 

which is a linear map of vector spaces. To show surjectivity it is enough to consider only one 
term v in the previous sum. Dropping indices for that case, the transformation becomes 

H = tUQ^ + t*Q Jf U. (2.24) 


where t*/t = z = ex.p(2nin/N). Surjectivity amounts to invertibility of this transformation, 
which implies that the kernel should vanish. By means of invertible transformations this 
problem maps onto the invertibility of the following transformation: 

Q' = P(Q) = Q + zU^QU (2.25) 


This can be easily shown using the basis in which U is diagonal with eigenvalues e lSa , where 
the previous expression reads 


Q' ab = 


ze 


-l(6a~6 b ) 


Qab : 


(2.26) 


Thus, the map has a well defined inverse whenever the expression in parenthesis does not 
vanish for any values of the indices a,b. Vanishing can only occur if z = — 1 (since in this 
case Q' aa = 0) or in a set of zero measure (i.e. when two different eigenvalues of U obey 
some special relation). So that, except for the z = — 1 case which will be commented later, 
the transformation can reach an arbitrary N x N complex staple matrix. 

Let us summarize. We first introduced the auxiliary variables with Gaussian 
probability. Then we performed a shift to obtain the Q^ variables. Thus, all sets of non¬ 
zero measure will have a non vanishing probability. By the previous proof we showed that, 
except for the exceptional cases mentioned earlier, this will induce a probability distribution 
in staple space given by Borel and strictly positive measure (i.e. every non-empty open 
subset on the space of N x N matrices has a positive measure). The next step, explained 
below, will be to show that both OR2 and ORN updates can then produce an arbitrary 
U ( N ) matrix with a non-zero probability. 


2.3.1 ORN updates 

Let us first study the case of an ORN update given by Eq. (2.19). In this case we perform 
the SVD decomposition of the staple matrix 

H = XcrV f (2.27) 


and update U according to 


U->WrfW (W = VX^). 


( 2 . 28 ) 




We only need to show that we can generate any W with non-zero probability. This is easily 
seen recalling that we can write 

H*H = (V<tV*) 2 , (2.29) 

and except in a set of zero measure the matrix H is invertible. Now the map 

GL(N, C) ->■ U(N) 

H W f = H/VlUH, (2.30) 

is continuous and therefore the inverse map sends an open neighborhood of W G U(N) 
into an open neighborhood of H E GL(N, C). Since the measure on GL(N, C) was strictly 
positive (i.e. any non-empty open set has measure bigger than zero) we have a non-zero 
probability of generating any U(N ) matrix. 

2.3.2 OR2 updates 

An OR2 update consists in successive left-multiplication of the link matrix U by matrices 
Auj\ E kL(i.j) ■ First let us show that the matrix is an arbitrary matrix of T~Luj)- If 

we call P(tj) the projectors onto the subspace we have 

A (ij) = p (i,j) uHp (i,j) »• ( 2 - 31 ) 

and since H is arbitrary, so is UH and therefore Au^ is an arbitrary element of 'H(ij)- 
The full update is given by successive multiplications 

A = H A {id ). (2.32) 

i<j 

Since both the multiplication and the projection to SU (2) are continuous, the application 

H A, (2.33) 

is continuous. Moreover we cover all SU (2) subgroups of SU (A), and the map is surjective. 
Therefore the inverse image of an open neighborhood of W G U(N) is an open non-empty 
subset on the space on matrices and again we have a non-zero probability of generating 
any SU(N ) matrix. 

2.3.3 The singular case and partially reduced lattices 

In a typical simulation of the TEK model, the singular case (z = —1) can only happen in 
a very particular situation, since in the TEK simulations the rank of the group matrices 
is usually taken a perfect square N = L 2 and the twist tensor |n MJ ,| = kL with k and L 
co-prinre. It is easy to see that these conditions imply that our singular case can only 
happen with N = 4 ,k = 1. 

Nevertheless these are sufficient conditions for the algorithm to be ergodic, but not 
necessary. In fact the previous proof shows something much stronger than ergodicity: that 
with a single sweep we can go from any configuration to any other with non-zero probability. 



We have performed some extensive simulations of the worst case (N = 4, k = 1) with 
0( 10 6 ) measurements, and found that both heatbath and over-relaxation thermalize to 
the same values starting both from a cold or hot configuration, and expectation values are 
consistent within errors. Moreover, we have not observed any significant dependence of the 
autocorrelation time with the value of z flu that could indicate a loss of ergodicity. 

In any case, if the reader is interested in simulating one of the exceptional situations 
one can simply perform a heatbath sweep from time to time to mathematically guarantee 
the correctness of the algorithm even in the singular case. 

We also want to point out that the above proof of ergodicity also applies to lattices 
in which at least two directions are reduced [34]. In this case the auxiliary variables are 
introduced only for the reduced directions. Since at least one term in the computation of 
the staples will have a contribution coming from the auxiliary variables, the staples will 
also be arbitrary in this case, and our proof applies. The numerical study of this case will 
not be covered in this work. 

2.4 Frequency of the update for the auxiliary variables 

In this subsection we will consider possible alternatives to algorithm 1 based on varying 
the relative ratio between the frequency at which the auxiliary variables are generated 
relative to the number of link updates per sweep. For the latter one can use either heat¬ 
bath (HB) or over-relaxation (OR2 or ORN) steps. In this alternative approach, we also 
alter the order in which generation and updates are performed. Thus, in this version (see 
algorithm 2) one generates the full set of (d(d — 1)/2) matrices Q^, and then performs n 
link updates using any of the alternatives. The ratio of Q generations to link updates now 
becomes d(d — l)/2n instead of the d(d — 1) of algorithm 1. 

It should be mentioned that our proof of ergodicity for overrelaxation updates does 
not directly apply to these new algorithms. Here one cannot separate the problem into 
independent directions and has to consider the full linear map from the vector space of 
d(d— l)/2 auxiliary Q matrices to the vector space of d staples. Notice, however, that for 
the algorithm not to be ergodic, the map from a configuration to another must be singular 
(i.e. with an almost anywhere vanishing Jacobian). Taking into account that these maps 
are perfectly regular (i.e. look at Eq. (2.30)), and that for d > 2 there are more auxiliary 
matrices than links, we think that it is quite plausible that algorithm 2 with only OR 
updates is ergodic as well. This is also supported by the numerical evidence that we have. 
A formal proof might not be hard to find, but given our preference for algorithm 1, based 
on the results given below, we did not make the effort to include it in the paper. 

In conclusion, our comparison of the two alternative algorithms will be based on the 
performance analysis that will follow. 

To make a comparison, the first thing to examine is the time needed to update the 
auxiliary variables. Generating each auxiliary variable u requires the generation of the 
random matrix Q^ u and two matrix multiplications (see Eq.(2.5)). Since generating the 
random numbers requires 0(N 2 ) operations, while matrix multiplication requires 0(N 3 ) 
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Algorithm 2 Alternative update algorithm for one site models. As is discussed in the 
text, keeping the auxiliary variables for many updates results in a worse performance. 

1: Generate d(d — l)/2 auxiliary variables Q^ v according to Eq. (2.5) 

2: for i = 1, n do 
3: for a = 0, d — 1 do 

4: Compute the Staples H a according to Eq. (2.9) 

5: Update U a using HB/OR2/ORN 


operations, we will neglect the time needed to generate the variables Q^ v 2 . On the 
other hand the computation of the staples attached to one link requires 2{d — 1) matrix 
multiplications in d dimensions. In particular for d = 4, each update of the auxiliary 
variables requires one third of the time required to compute the staples. Moreover all of 
the previously described algorithms also require 0(N 3 ) operations. In practical situations, 
we have measured that computing the staples takes about the same CPU time as an update 
sweep. 

Although, the ratio of generations over link updates per sweep d(d — l)/2 n is smaller 
for the new algorithms, from the numerical point of view the gain is only marginal, since 
the generation of the auxiliary variables takes 0(10%) of the computer time of a typical 
update. 

If from a practical point of view this approach results in a better algorithm is a more 
complicated question, where autocorrelation times have to be taken into account. Clearly 
the update of the auxiliary variables is crucial to achieve ergodicity, and therefore if one does 
not update these variables frequently enough one might end up exploring a small region 
of the space of configurations. This is nicely illustrated by looking at the thermalization 
process (Figure 1). As the reader can see, the expectation value seems to “plateau” very 
fast in between the updates of the auxiliary variables. Note that if one looks at the 
combination of introducing the auxiliary variables and n updates, the algorithm is actually 
thermalizing to the correct value, and there is no loss of ergodicity. But the figure indicates 
that performing n > 1 updates in between the introduction of the new auxiliary fields is 
redundant and does not improve the thermalization at all. 

Since the Markov operator is the same during thermalization as during equilibrium 
updates, this suggests that autocorrelations might be significantly reduced only by the first 
update after introducing the auxiliary fields. This is in fact the case, as the data of Table 1 
shows. We can see that the most efficient algorithm is our original proposal (algorithm 1), 
and that keeping the same set of auxiliary variables does not help in decorrelating the 
measurements. 

2 This is a good approximation for the values of N we consider. For example, when N > 400 the difference 

between using different random number generators (RNG) becomes negligible even when one RNG is 2-5 
times faster than the other. 
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Plaquette Thermalisation: N=121, b=0.36 


0.9 


0.5 


Algorithm (1) 
Algorithm (2) with n=1 ORN updates 
Algorithm (2) with n=2 ORN updates 
Algorithm (2) with n=5 ORN updates 
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200 


Figure 1: Comparison of the thermalization of the plaquette for N = 121, b = 0.36, k = 3 
for different frequencies of Q-updates. It is clearly beneficial during thermalization to 
update the auxiliary valiables Q as frequently as possible. 


Update algorithm 

WM)) 

Tint 

Algorithm (1) 

0.558093(26) 

6.3(3) 

Algorithm (2) with n = 1 ORN updates 

0.558079(28) 

7.3(4) 

Algorithm (2) with n = 2 ORN updates 

0.558044(30) 

8.5(5) 

Algorithm (2) with n = 5 ORN updates 

0.558068(36) 

12.4(1.4) 


Table 1: lxl Wilson Loop average values and integrated correlation times for N = 121, 
b = 0.36, k = 3 with 2.5 x 10 5 sweeps, for different frequencies of Q-updates. It is clear 
that doing more ORN updates at fixed Q in fact makes the autocorrelations worse. The 
integrated autocorrelation time is defined in Eq.(3.2) 

2.5 Simulation algorithms and vectorization 

Let us finally comment on a few points of the previously described algorithms. Lattice 
simulations usually make use of massive parallel machines, but the simulation of one-site 
models is also challenging in this respect. The link variables U a are dense matrices, and 
a distributed approach seems unfeasible. Nevertheless one can make use of the vector or 
multi-core structure of the CPU : 

SU( 2) Projections This update allows an easy vectorization. Some of the N(N — l)/2 
subgroups of the set S of Eq. (2.13) can be updated in parallel. For odd N, there are 
l = (N — l)/2 SU( 2) subgroups 'Huj) that can be updated simultaneously. In the 
case of even N, there are l = N /2 subgroups that can be manipulated simultaneously. 
For example, for N = 5, we have the following 5 sets of two subgroups that can be 
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updated simultaneously: 

{^( 1 , 2 )) %( 3 , 4 )}; {^( 1 , 3 )! ^( 2 , 5 )}> {^( 1 , 4 ) 5 ^( 3 , 5 )}> {^( 1 , 5 ) >^( 2 , 4 )}> {^( 2 , 3 ) >^( 4 , 5 )} • 

(2.34) 

Moreover the over-relaxation updates based on SU (2) projections do not need to gen¬ 
erate any random numbers, something that might be convenient for parallelization. 

U(N) over-relaxation The vectorization of the U(N ) over-relaxation updates basically 
requires writing an SVD routine that makes use of the vector/multi-core structure 
of the CPU. Some LAPACK implementations have support for multi-threading, and 
represent an alternative way to profit from the multi-core structure of the CPU. 

3 Numerical comparison of algorithms 

Since the purpose of any simulation is to compute the expectation values of some observ¬ 
ables with the highest possible accuracy, the merit of an algorithm has to be evaluated by 
comparing the uncertainties of some observables per unit of CPU time. There are only 
two factors that have to be taken into account in our particular case: the CPU time per 
update sweep, and the autocorrelations of the measurements. 

After having introduced the auxiliary variables, and transformed the action in a lin¬ 
ear function of the links, one can use different updating methods: heatbath (HB) and 
over-relaxation (OR2) based in the projection onto 517(2) subgroups and the U(N) over¬ 
relaxation (see Algorithm 1). Since all three sweep methods have to compute the same 
staples and generate the same auxiliary variables Amdahl’s law puts a limit on the theo¬ 
retical improvement that one sweep method can have over the others. 

In principle OR2 sweeps are very simple and do not require to evaluate any mathe¬ 
matical function like cos, log,..., but we have observed that the CPU time for a HB sweep 
is essentially the same as the CPU time required for an OR2 sweep. 

The case of ORN is more difficult to evaluate, and depends crucially on the implemen¬ 
tation of the SVD decomposition. With the standard BLAS/LAPACK implementation, the 
time per sweep is roughly the same as the time required for an HB/OR2 sweep (faster for 
N > 300, slightly slower for small values of N). But we have observed that some LAPACK 
implementations of the SVD decomposition (like intel MKL) run up xlO faster for large 
N. This is mainly due to the fact that the prefetching of data and block decomposition 
used by many BLAS/LAPACK efficient implementations actually sustain the manifest 0(N 3 ) 
scaling of the algorithm up to very large values of N (when the matrices are actually several 
hundreds of MB in size). 

In any case all these update methods scale like 0(N 3 ), and hereafter we will assume 
that the three updating methods (HB/OR2/ORN) have the same cost per sweep. This re¬ 
duces the question of the most efficient algorithm to the one that decorrelates measurements 
faster. Nevertheless the reader should keep in mind that if an efficient implementation of 
BLAS/LAPACK is available for the architecture they are running, there might be a saving in 
CPU time. 
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To make the comparisons we have chosen different observables. Firstly Wilson loops 
of different sizes; W(l, 1), W( 3,3) and W(5, 5). These are the classic observables that have 
been used in the past for several studies. 

Secondly, smeared quantities that are known to have large autocorrelation times. In 
particular quantities derived from the gradient flow have been used to test how the slow 
modes of algorithms decorrelate [35]. We will consider the renormalized coupling A tgf [36] 
as defined for the TEK model in Ref. [15]. We will tune b for various values of N, such 
that A tgf — 23, and hence the physical volume (ay/N) is kept approximately constant. 
This will allow us to study how the autocorrelation time of a physical quantity scales as 
we approach the continuum limit. 

Measuring flow quantities is very expensive for the case of the TEK model. Compared 
with a typical update, the integration of the flow equations can consume 0( 10 5 ) times more 
computer time. This is easy to understand if one takes into account that the measurement 
involves the computation of 0(1O 4 ) times the exponential of a matrix. Usually it is not 
a huge problem to measure these observables with a high accuracy, since the observables 
have a very small variance and one can easily run many parallel replicas. But in order 
to measure autocorrelation times we actually need to collect at least 0(lOOr in t) measure¬ 
ments on the same Monte Carlo chain. Since r; n t can easily go up to 200, this task is 
actually very difficult. We address this difficulty in two ways. First we measure the flow 
quantities frequently enough to actually measure the correlation between the data, but not 
too frequently. We aim at measuring integrated autocorrelation times of the order of 10. 
The values that we will present for Tj n t are then computed by scaling with the number of 
updates in between measurements. Second we use a large step size to integrate the flow 
equations. We use the third order Runge-Kuta integrator described in the appendix of [37], 
with a step size of e = 0.05. We explicitly checked the size of the systematic error in the 
measured coupling due to the numerical integration on several configurations for each value 
of N. In all cases the size of this relative error on the measured coupling was below 0.005%, 
moreover this error decreased with increasing N, so the systematic error due to the large 
integration step size is negligible compared to our 0(0.5%) statistical errors. 


3.1 Comparison of link update algorithms and scaling 

Our Monte Carlo simulations produce correlated estimates of the observables of interest. 
The correlation between measurements can be quantified by Ti n t. We follow the notation 
and conventions of [38] and define the autocorrelation function of the observable O as 

Fo(t) = ((0(t) - O)(O(0) - O)), (3.1) 


were the simulation time t labels the measurements and O is the average value of our 
observable of interest, we have 


_ 1 

Tint - - 


\ ^ r Q (f) 

k r °( o)' 


(3.2) 


In practical situations the sum Eq. (3.2) has to be truncated after a finite number of 
terms W that defines our summation window. Since the autocorrelation function ro(t) 
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decays exponentially fast for large t, estimates of Tint will be accurate as long as W is 
large compared with the slowest mode of the Markov operator. In this work we find it 
is sufficient to simply truncate the sum when the estimate of r^t shows a plateau in W 
(see figures below), but other situations may require more sophisticated techniques (see for 
example [39]). 

We now move on to comparing the three different update algorithms, HB, OR2 and 
ORN. In Table 2 we show the results from the three update methods of Wilson loop 
measurements at N = 49 with 2 x 10 6 updates for each. Similarly Table 3 and Figure 2 
show Wilson loop measurements at N = 289 with 6 X 10 5 updates for each. All the 
update methods give observables which are in agreement within errors, and the OR updates 
result in observables with about half the integrated autocorrelation time of the HB update. 
However, despite the increased computational complexity of the ORN method compared 
to the OR2 method, it does not result in a significantly smaller integrated autocorrelation 
time. 


b = 0.36 

(W(l,l)> 

Tint 

HB 

0.55690(4) 

14.9(3) 

OR2 

0.55694(3) 

8.7(2) 

ORN 

0.55698(3) 

7.2(1) 


Table 2: Wilson Loop average values and integrated correlation times at N = 49 ,k = 1 
for the three different update methods. 


b = 0.36 

(W(3,3)) 

Tint 

b = 0.50 

<W(5,5)> 

Tint 

HB 

0.032373(15) 

30(2) 

HB 

0.0424222(104) 

10.4(4) 

OR2 

0.032401(13) 

20(2) 

OR2 

0.0424268(78) 

6.0(4) 

ORN 

0.032390(11) 

16(1) 

ORN 

0.0424213(76) 

5.5(2) 


Table 3: Wilson Loop average values and integrated correlation times at IV = 289, k = 5 
for the three different update methods. 

Figure 3 shows the integrated autocorrelation time of A tgf for N = 121 for the three 
update algorithms. We see the same results as for the Wilson loop observables; OR updates 
have around half the of HB updates, with little difference between OR2 and ORN. Since 
we have tuned the physical volume to be constant, we can repeat these measurements at 
different values of N to determine how the integrated correlation time scales as a function 
of the lattice spacing a, using the relation y/~N = L/a. The results are listed in Table 4 for 
the three update methods, and Tint as a function of a -2 is shown in Figure 4. 

The generic expectation in normal lattice simulations is for a local update to behave 
like a random walk, and so for r, n t to scale like a -2 . A priori it is not clear that this 
should also apply to the TEK model, since in this case a “local” update of a single gauge 
link actually involves all the degrees of freedom in the lattice. It turns out however that 
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T int of W(3,3) at b=0.36, N=289 



W 


of W(5,5) at b=0.50, N=289 




***** 


wbBBBS 




100 200 300 400 500 600 


Figure 2: Integrated autocorrelation time estimates as a function of the window size 
(see Eq. 3.2) for the 3x3 Wilson Loop at IV = 289,6 = 0.36 (left), and the 5x5 Wilson 
Loop at IV = 289,6 = 0.50 (right). Different symbols correspond to different updates 
(HB/OR2/ORN). We take Tint as the plateau value. 



HB 

OR2 

ORN 

N 

6 

k 

A TGF 

Tint 

A TGF 

Tint 

A TGF 

Tint 

100 

0.358 

3 

22.3(1) 

142(11) 

22.3(1) 

74(4) 

22.0(1) 

65(6) 

121 

0.360 

3 

22.8(2) 

160(16) 

22.8(1) 

91(8) 

22.6(1) 

68(6) 

169 

0.365 

4 

24.3(2) 

224(20) 

24.2(2) 

139(15) 

24.3(2) 

104(9) 

225 

0.372 

4 

21.8(1) 

367(32) 

21.9(1) 

167(14) 

21.8(1) 

134(12) 

324 

0.375 

5 

24.2(2) 

426(53) 

24.1(2) 

214(23) 

24.0(2) 

195(20) 


Table 4: A tgf — 23 average values and integrated correlation times at each N for the 
three different update methods. 

all three update methods exhibit the same critical scaling as one would expect for a local 
algorithm, namely Tint ~ a -2 , as can be seen in Figure 4. 

4 Conclusions 

We have studied several simulation algorithms for one-site lattice models. In particular we 
have focused on the TEK model, which is relevant in the context of the large N expansion 
of gauge theories and has recently been used to compute many interesting properties of 
SU(oo) gauge theories [14, 15, 17]. 

Following Fabricius and Haan [32] we introduce auxiliary variables to make the action 
a linear function of the links, and study several link-update algorithms. Up to now all 
authors included a combinaton of heat-bath steps and overrelaxation steps in their algo¬ 
rithms. However, we show that this is not necessary, since once the auxiliary variables are 
updated, overrelaxation alone suffices to make the algorithm ergodic. This is in contrast 
with the usual lattice gauge theory, where over-relaxation sweeps produce microcanonical 
movements and are therefore not ergodic. 
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t^W) for A, T q F coupling: c=0.3, N=121, b=0.36 



Figure 3: Integrated autocorrelation time estimates as a function of the window size 
(see Eq. 3.2) for the Twisted Gradient Flow coupling vs W, for A tgf ~ 23 at N = 121, 
b = 0.36, c = 0.30. OR updates are significantly better than HB, with no significant 
difference between OR2 and ORN. We take Ti nt as the plateau value. 


T int of ^TQF=23 VS N 



N = L 2 /a 2 


Figure 4: ORN Integrated autocorrelation time Ti n t vs N = (L/a) 2 , at fixed physical 
volume, A tgf — 23. All three updates appear to scale with the same critical exponent 
Tint ~ a ~ 2 


Regarding the over-relaxation updates, we study two kinds. First the one based in 
the projection over 51/(2) subgroups (OR2). Second we study the OR method over the 
whole group as proposed in [8, 33] (ORN). Indeed, we realize that for one-site models the 
algorithm does not change the value of the action, making a Metropolis accept/reject step 
unnecessary. This is due to the equivalence of U(N) and SU(N) groups for these models. 

Finally, we perform a performance comparison between the different alternatives. We 


- 16 - 












































show that, at different values of N for different observables, overrelaxation sweeps decor¬ 
relate faster than heatbath sweeps (t® r /t rb ~ 1/2). We see no big differences in terms of 
autocorrelation times between the two possible overrelaxation methods (OR2 and ORN). 
Each algorithm has his own benefits. OR2 is simple and easy to vectorize. On the other 
hand ORN might profit from a highly optimized routine for matrix operations, in partic¬ 
ular the SVD decomposition. We conclude by studying the performance for computing 
a renormalized quantity in the continuum limit, the twisted gradient flow renormalized 
coupling. We see that the three tested link-update algorithms scale like a~ 2 . Hence, none 
of them has a better critical exponent. 
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A SU(N) overrelaxation 

Let us recall that the overrelaxation transformation takes the form: 



For the U(N ) case one takes W a = V a X' a , where X and V are U(N ) matrices obtained 
from the SVD decomposition of the staple: 



However, this transformation is not valid for SU ( N ) since 

det[V a Xl] = e+ 1. 

Following Ref. [8], we will propose the SU(N) matrix 

W' a = V a D a X t , 

where D = diag(e 101 ,..., e ldN ), with the condition 

9i = <J> mod 2ir 


(A.3) 


(A.4) 


(A.5) 


The simplest way to implement the constraint would be to take 0* = <h/A, as done in 
Ref. [8]. This is then followed by an accept/reject Metropolis step as explained earlier. We 
find this works reasonably well in practice, with an acceptance rate of ~ 85% at b = 0.36 
for a range of values of N from 81 to 1369, with no noticeable dependence on N, and with 
higher acceptance rates at weaker values of the coupling. 

As suggested in Ref. [33], the acceptance rates can be improved by tuning the angles 
Qi to minimize the quantity ReTr [W a H a \, i.e. choose 0* such that 



(A.6) 


Since the minimization has to be done preserving the constraint Eq. (A.5), the best way is 
to add a lagrange multiplier A enforcing the condition in the minimization function: 



(A.7) 


The equations for minimum lead to the solution 


9i = arcsin — 

cq 

where the quantity A satisfies the equation 


(A.8) 



(A.9) 


The latter is a single transcendental equation, which can be solved by any standard tech¬ 
nique, like Newton-Raphson. We have observed that this procedure converges very fast, 
and in fact increases the acceptance rate to about ~ 98%. 
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